SWE-bench Verified
The ability to autonomously complete software engineering tasks is a key component of our Medium risk level in the Model Autonomy risk category.
Coding agents have made impressive progress on SWE-bench, with top scoring agents scoring 20% on SWE-bench and 43% on SWE-bench Lite according to the SWE-bench leaderboard(opens in a new window) as of August 5, 2024.
SWE-benchのいくつかは難しすぎる、解けない
scikit-learn__scikit-learn-14520
Note that the agent is only given the problem description from the main issue text, and does not have visibility into the tests that it needs to pass. Given this setup, it would be nearly impossible for an agent to solve this sample in SWE-bench.